-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Check before calling DTA._from_sequence instead of catching #29589
Conversation
result = arr._from_sequence(result, dtype="datetime64[ns, UTC]") | ||
result = result.astype(dtype) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you can end up here with a different dtype that is not convertible to datetime64 (there is no guarantee that your groupby function or aggregation returns the same dtype as the original column). In such a case, we should not raise the error, but rather fall back to the original result
as done now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT the relevant functions are all dtype-preserving (well, e.g. sometimes int can be cast to float, but not relevant for datetime64). Is there a case that im forgetting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with "the relevant functions" ? I think this gets used for user defined functions as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of the calls that get here go through our cython functions, but you're right that we can also get here from _python_agg_general. But in the case of user defined functions that may not be dtype-preserving, we shouldn't be doing this casting at all, regardless of whether we pre-check vs catch
Not fully sure if it should try this casting or not, but right now it does reach here, so your change has implications for that. Consider this example:
That will fail to convert to datetime, raise a TypeError and thus now fail with this branch. Now, you are right that we maybe shouldn't try to cast (or at least only if the scalars are actual datetime objects), because if you write a function that returns integers, that now incorrectly casts back to datetime64:
However, this case is a regression, as it worked correctly in pandas 0.23 (started to fail in 0.24, probably with the datetime arrays refactor). |
About the "need" to cast here, eg this example would otherwise not have tz-aware result I think:
|
@jorisvandenbossche first I agree that this should be on hold until we sort out the desired when-to-try-casing behavior. My read of your last two comments suggests that you agree with my previous comment that we should not be re-casting for UDFs? |
I agree with this statement |
It would also be a regression to not cast (see the example in my last comment above; that would no longer output the correct dtype if there was no cast back to the original type). So I think we should try some casting, but not in such a liberal way as done now (similar to the "soft conversion" in infer_objects, which will not cast integers to datetime64, but will cast datetime objects) |
@jorisvandenbossche for non-UDFs the behavior in this PR (and master) is correct. For UDFs you have convinced me that the casting should not be done. Does this accurate reflect your views? |
Not really, in my last comment I said: "So I think we should try some casting, but not in such a liberal way as done now." Now, "trying to cast" is always a box of pandora (but we do that a lot in pandas in all kinds of cases), I am mainly noting that not doing it would give a regression in my example above (#29589 (comment)) |
Locally, if I disable this casting exclusively for UDFs, both the "So I think we should try some casting, but not in such a liberal way as done now." is vague compared to "do this casting only for non-UDFs". Can you phrase your preferred approach more specifically? |
Can you show the result of what you get locally? |
After adding a
Does this match the expected/correct output? |
Not sure what we do differently, what I tried was to just comment out the full block with casting in diff --git a/pandas/core/groupby/groupby.py b/pandas/core/groupby/groupby.py
index 280f1e88b..ab347b0bd 100644
--- a/pandas/core/groupby/groupby.py
+++ b/pandas/core/groupby/groupby.py
@@ -788,29 +788,29 @@ b 2""",
else:
dtype = obj.dtype
- if not is_scalar(result):
- if is_datetime64tz_dtype(dtype):
- # GH 23683
- # Prior results _may_ have been generated in UTC.
- # Ensure we localize to UTC first before converting
- # to the target timezone
- arr = extract_array(obj)
- try:
- result = arr._from_sequence(result, dtype="datetime64[ns, UTC]")
- result = result.astype(dtype)
- except TypeError:
- # _try_cast was called at a point where the result
- # was already tz-aware
- pass
- elif is_extension_array_dtype(dtype):
- # The function can return something of any type, so check
- # if the type is compatible with the calling EA.
-
- # return the same type (Series) as our caller
- cls = dtype.construct_array_type()
- result = try_cast_to_ea(cls, result, dtype=dtype)
- elif numeric_only and is_numeric_dtype(dtype) or not numeric_only:
- result = maybe_downcast_to_dtype(result, dtype)
+ # if not is_scalar(result):
+ # if is_datetime64tz_dtype(dtype):
+ # # GH 23683
+ # # Prior results _may_ have been generated in UTC.
+ # # Ensure we localize to UTC first before converting
+ # # to the target timezone
+ # arr = extract_array(obj)
+ # try:
+ # result = arr._from_sequence(result, dtype="datetime64[ns, UTC]")
+ # result = result.astype(dtype)
+ # except TypeError:
+ # # _try_cast was called at a point where the result
+ # # was already tz-aware
+ # pass
+ # elif is_extension_array_dtype(dtype):
+ # # The function can return something of any type, so check
+ # # if the type is compatible with the calling EA.
+
+ # # return the same type (Series) as our caller
+ # cls = dtype.construct_array_type()
+ # result = try_cast_to_ea(cls, result, dtype=dtype)
+ # elif numeric_only and is_numeric_dtype(dtype) or not numeric_only:
+ # result = maybe_downcast_to_dtype(result, dtype)
return result and then I get this (notice that the resulting dtype is not tz-aware here, in constrast with your example):
|
I just changed the I'm now working on a new branch that should handle the cases you've mentioned correctly. |
@jorisvandenbossche im working on adding the examples from here into #29641, want to confirm the expected output. The results I posed here did not have a name for the Series; should the name be "b"? |
Closing in favor of #29641 |
Comment in core.apply is unrelated, just doesnt merit its own PR.